Designing and Evaluating a Russian Tagset
نویسندگان
چکیده
This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other
منابع مشابه
A Positional Tagset for Russian
Fusional languages have rich inflection. As a consequence, tagsets capturing their morphological features are necessarily large. A natural way to make a tagset manageable is to use a structured system. In this paper, we present a positional tagset for describing morphological properties of Russian. The tagset was inspired by the Czech positional system (Hajič, 2004). We have used preliminary ve...
متن کاملTowards a reference tagset for Japanese
This is a progress report on ongoing research aimed at proposing a ‘reference’ morphosyntactic part-of-speech tagset for the Japanese language. Such a tagset should be linguistically motivated, explicit, broadly applicable, and computationally tractable. Being well defined, such a tagset should be easily adapted in specific ways (e.g. limited, extended or modified). The author is currently atte...
متن کاملDesigning a Common POS-Tagset Framework for Indian Languages
Research in Parts-of-Speech (POS) tagset design for European and East Asian languages started with a mere listing of important morphosyntactic features in one language and has matured in later years towards hierarchical tagsets, decomposable tags, common framework for multiple languages (EAGLES) etc. Several tagsets have been developed in these languages along with large amount of annotated dat...
متن کاملBuilding a Dependency Parsing Model for Russian with MaltParser and MyStem Tagset
The paper describes a series of experiments on building a dependency parsing model using MaltParser, the SynTagRus treebank of Russian, and the morphological tagger Mystem. The experiments have two purposes. The first one is to train a model with a reasonable balance of quality and parsing time. The second one is to produce user-friendly software which would be practical for obtaining quick res...
متن کاملEvaluating Distributional Properties of Tagsets
We investigate which distributional properties should be present in a tagset by examining different mappings of various current part-ofspeech tagsets, looking at English, German, and Italian corpora. Given the importance of distributional information, we present a simple model for evaluating how a tagset mapping captures distribution, specifically by utilizing a notion of frames to capture the ...
متن کامل